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Abstract Click-through data has been used in various 
ways in Web search such as estimating relevance be- 
tween documents and queries. Since only search snip- 

. pets are perceived by users before issuing any clicks, the 
relevance induced by clicks are usually called perceived 
relevance which has proven to be quite useful for Web 
search. While there is plenty of click data for popular 
queries, very little information is available for unpopu- 

, lar tail ones. These tail queries take a large portion of 
the search volume but search accuracy for these queries 

, is usually unsatisfactory due to data sparseness such 

' as limited click information. In this paper, we study 
the problem of modeling perceived relevance for queries 
without click-through data. Instead of relying on users' 
click data, we carefully design a set of snippet features 

\ and use them to approximately capture the perceived 
relevance. We study the effectiveness of this set of snip- 
pet features in two settings: (1) predicting perceived rel- 
evance and (2) enhancing search engine ranking. Exper- 
imental results show that our proposed model is effec- 
tive to predict the relative perceived relevance of Web 
search results. Furthermore, our proposed snippet fea- 
tures are effective to improve search accuracy for longer 
tail queries without click-through data. 

1 Introduction 

Designing effective ranking functions to satisfy all kinds 
of information needs of end users is admittedly difficult. 
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A practical approach used by commercial search engines 
is to collect all possible useful signals or features and 
combine them together using techniques such as learn- 
ing to rank [71157115], Beyond the regular text matching 
features (e.g., TF-IDF), click-through data has been 
studied extensively p2l[r2l[2l[2Tl[2^] . Noticeable usages 
of click-through data include propagating semantic in- 
formation between queries and documents [ M2"o1[Ml H] . 
estimating document relevance [22, 12, 16,36, 10 , defin- 
ing features in learning to rank [2,21 , etc. All the exist- 
ing works have demonstrated the unique values of click- 
through data in improving search engines from many 
perspectives. A main advantage of click-through data 
is that it contains users' implicit perceived relevance 
feedback. 

A well-known challenge in leveraging click-through 
data is that click-through information is very noisy and 
biased by many factors such as presentation order and 
appearance {2"51[TT1I55] . Many studies such as [T2"lll31lll)l 
[TUll2"51l2"0] have attempted to address the position bias 
to extract the relevance between documents and queries 
which is hidden in the clicks. Since only search snippets 
are perceived by users before issuing any clicks, the rel- 
evance induced by clicks are usually called perceived 
relevance. In general, when there are sufficient clicks 
information for a query, existing approaches can esti- 
mate the perceived relevance reliably which has been 
proven to be quite effective to improve Web search. 

While there are plenty of click information for pop- 
ular or head queries, unfortunately, very little informa- 
tion is available for tail ones and existing methods either 
can not be applied to or can give unreliable estimation 
for tail queries due to limited click data. According to a 
recent study [3T], queries submitted to Web search en- 
gines follow a heavy-tailed power-law distribution. Thus 
a large fraction of queries are issued very infrequently, 
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forming the well-known "long tail" [3J. Naturally, the 
useful signals for such tail queries are very scarce in 
search logs. As a result, the benefit of click-through 
data is mainly for popular head queries and current 
search engines usually perform poor for tail queries |14| . 
A recent study [18j shows that almost every individual 
user has both head and tail requests for Web search. 
Thus, poor search results on tail queries can not only 
make most of users unsatisfactory for their immediate 
requests, but also deteriorate their overall perceptions 
of a search engine. Interestingly, [T5] has also shown 
that there is a second-order effect that satisfactory re- 
sults for tail requests can significantly boost the head 
requests due to increased user satisfaction and resulting 
repeat patronage. However, search accuracy for these 
queries is usually unsatisfactory due to data sparsencss. 
Thus it remains a challenge to improve search quality 
for tail queries. 

The importance and uniqueness of tail queries has 
been noticed recently. Existing works on tail queries 
mainly focus on aspects such as query classification [5J , 
query advertisability [27], and query suggestions |32j . 
Surprisingly, there are few works on directly improv- 
ing search accuracy for tail queries, which is the most 
important aspect of a search engine. In this paper, we 
propose a self-reinforcement way for tail queries. Moti- 
vated by the perceived relevance in click-through data, 
our main idea is to capture the perceived relevance 
based on search result snippets without requiring any 
click-through data. Search result snippets are valuable 
resources for the following reasons: (1) Search result 
snippets are highly correlated with click data and thus 
the underlying perceived relevance. (2) The snippets are 
summaries of the documents which are the most rele- 
vant passages deemed by the snippet generation meth- 
ods. Passage level relevance |Sj can be modeled by match- 
ing queries with search snippets. 

Specifically, we define a set of snippet features whose 
goal is to capture the perceived relevance from multiple 
perspectives, including language attractiveness, URL 
attractiveness, and query-snippet matching attractive- 
ness. All of these features do not need any user click 
data and can be computed solely based on queries and 
snippets. We study the effectiveness of this set of snip- 
pet features in two settings: (1) predicting perceived 
relevance and (2) enhancing search engine ranking. For 
(1), we first estimate perceived relevance for queries 
which have sufficient clicks using an existing dynamic 
Bayesian network model. We then train a machine learn- 
ing model to predict the estimated perceived relevance. 
For (2), however, it is not straightforward to incorpo- 
rate these features into a search process since most of 
the features can be only computed after the query- 



dependent snippets are generated. We thus propose two 
strategics to leverage these snippet features. Our first 
strategy is to combine the predicted perceived relevance 
scores with the original ranking scores to rerank search 
results. Our second strategy is to expand the original 
ranking features by adding the snippet features to learn 
a new ranking function. We show that both strategies 
can be naturally incorporated into a search process in 
different application scenarios. 

We evaluate the usefulness of our defined snippet 
features based on a large set of queries and snippet 
features from a commercial search engine. Experimental 
results shows that the defined snippet features can give 
good prediction of perceived relevance and it can also 
improve the search accuracy significantly. 

2 Related Work 

The long tail view was first coined in [3J and has been 
observed for many diverse applications like e-commerce 
and Web search [TB]. Our work is more related to the 
long tail study in Web search. For example, [14] com- 
pared head queries and tail queries in terms of search 
accuracy and users search behaviors. [B] proposed ro- 
bust algorithms for rare query classification. [37] stud- 
ied the advertisability of tail queries in sponsored search 
and proposed a word-based approach for online effi- 
cient computation. [33] studied query suggestions for 
rare queries but their approaches still assume that there 
is click information to leverage. In contrast, our work 
is on directly improving the search accuracy, which is 
the most important aspect of a search engine, for tail 
queries without any click-through data. 

In the past, snippets have been used by many differ- 
ent purposes such as query classification [BJ and mea- 
suring query similarity [30] . In particular, our work is 
related to [5] . In [5] , some snippet features such as over- 
lap between the words in title and in query are used, 
together with user behavior and click-through features. 
The main finding of their study is that click features 
are the most useful for general queries. In our work, we 
focus on tail queries which do not have any click infor- 
mation. We define a more comprehensive set of snippet 
features and discuss different application scenarios to 
efficiently leverage these snippet features. 

Our work is related to click models and a number 
of recent studies has been conducted to analyze click 
data [2^1IT21I3B1[TU1[TB1[2^] . For example, [22 examined 
several rule-based methods to extract the relative pref- 
erence between a pair of documents from click data. Re- 
cently, all clicks in a search session are modeled together 
and thus the dependency among clicks in different posi- 
tions can be modeled. For example, cascade model [T5] 
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assume that user sequentially examine results and stop 
as soon as a relevant document is clicked. [TU] and [TU] 
analyze click data based on different Baysian genera- 
tive models and perceived relevance is estimated by 
fitting the models to observed click data. [20] further 
extends these models to consider intent diversity. A re- 
cent approach [TU] uses a session utility model to es- 
timate the "intrinsic relevance" of each clicked docu- 
ment. Both [TU] and [TU] argued the difference between 
"perceived relevance" and "true relevance." The main 
resources they relied on to estimate true relevance of a 
click are the session activities after the click. Usually 
a document which is clicked last is given a higher rele- 
vance score. In our work, we choose to model perceived 
relevance since there are no actual click information 
for tail queries and it is hard to model what happened 
afterwards. Furthermore, all these works only leverage 
the click information and have not considered search 
result snippets. 

Our work is also related to click prediction works [TJ 
l28l[T5] . [TJ used an existing hierarchy to propagate clicks 
to rare events. [15] used past clicks to predict future. 
[28] proposed a feature-based method of predicting the 
click-through rate for new ads. To the best of our knowl- 
edge, few works have been conducted to predict click- 
based perceived relevance for tail queries in Web search. 
Furthermore, compared with |28] which only uses query- 
dependent features, we explore a more compressive fea- 
ture set with both query-dependent and query-independent 
features. 

3 Perceived Relevance and User Clicks 

Click-through data has been extensively studied recently 
P2l[T2l[T0l[2TJ] . A common observation is that click data 
contains users' perceived relevance feedback and this 
information is quite effective to improve Web search. 
However, click data is noisy and biased by many factors 
such as presentation order and appearance [23KHI3S] - 
Many studies such as [121IT91IT01I23] have attempted to 
address the position bias to extract the relevance be- 
tween documents and queries which is hidden in the 
clicks. Technically, perceived relevance is usually cap- 
tured by the click probability given the corresponding 
search result has been examined by end users. By def- 
inition, perceived relevance is independent of position 
bias. In the following, we give a brief review of the Dy- 
namic Bayesian Network (DBN) model which can ef- 
fectively extract the perceived relevance from a click 
session [TO] . 

The DBN model is based on the cascade model pro- 
posed in [12] . The cascade model assumes that a user 
examines the search results sequentially from top to 



bottom and decides whether to click a search result. 
After a document u is examined, it is cither clicked 
with probability a u or skipped with probability (1 — a u ) 
where a u denotes the degree of attractiveness or per- 
ceived relevance. The cascade model assumes that a 
user who clicks never comes back and a user who skips 
always continue. A click on the z-th document means 
that the user skips all the documents ranked above and 
the user is satisfied by the z-th document 

i-l 

P{C i = l)=a i J{{l-a u ) 

u=l 

All above assumptions clearly oversimplify the prob- 
lem. The model suffers indeed from only being able to 
consider sessions with exactly one click. 

[TO] extends the cascade model and proposes a Dy- 
namic Bayesian Network (DBN) model to simultane- 
ously model the relevance of all documents in the search 
results. The DBN model introduces the notion of sat- 
isfaction to separately model the relevance of the land- 
ing page and perceived relevance on the search results 
page (attractiveness). Formally, we use binary random 
variables Ei,Ai,Si and Ci to denote examination, at- 
tractiveness, satisfaction, and click of i-th document. A 
session is generated by the following procedure, assum- 
ing Ei = 1 and all other default values are 0: 

— For each position i, sample an attractiveness prob- 
ability ai from a Beta prior distribution. 

— For each position i. sample a satisfaction probability 
Si from a Beta prior distribution. 

— Repeat for each position i 

— Sample Aj = 1 with probability a^. Set Ci = A- t 
if Bk = 1 

— If Ci = 1, sample Si = 1 with probability Sj. 
Otherwise set Si = 

— If Si = 1, set Ei+i = 0. Otherwise, sample 
Ei + \ = 1 with probability 7. 

The parameter 7 is the perseverance parameter and a 
user may give up the search with probability 1— 7 before 
satisfied. Assuming users always examine the first posi- 
tion, attractiveness, indeed is a prediction of the CTR 
at position 1. Given a set of click sessions in search logs, 
we can find a maximum a posterior (MAP) estimation 
of ai and Si by an EM algorithm [TU] . The obtained Oj 
denotes the degree of attractiveness of i-th document. 

In general, when there are sufficient click informa- 
tion for a query, existing approaches such as DBN can 
estimate the perceived relevance reliably. However, the 
existing click-based methods only rely on the click in- 
formation in search logs but totally ignore the search 
snippets. While there are plenty of click information 
for popular or head queries, unfortunately, very little 
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information is available for tail ones and existing meth- 
ods either can not be applied to or can give unreliable 
estimation for tail queries due to scarce click data. In 
the next section, we describe our approach to capturing 
perceived relevance using search snippets. 

4 Capture Perceived Relevance for Tail Queries 

Tail queries pose a big challenge to leverage the click- 
based methods. Since the ultimate goal of the click- 
based methods is to capture the perceived relevance 
and search results snippets are the main information 
sources before a user issues a click, we thus try to cap- 
ture perceived relevance for tail queries based on search 
snippets in this section. 

4.1 A Motivating Experiment 

Our hypothesis is that there is a strong correlation be- 
tween perceived relevance and snippets in a Web search 
result page. We test this hypothesis using a simple ex- 
periment as follows. In this experiment, we collected a 
set of tuples (q,u\,U2) where u\ and ui are any two 
URLs that appear in the same search result page for 
the query q during some period of time. We then com- 
puted the number of missing query tokens in titles. We 
examined how likely u\ is clicked more frequently than 
U2 when u\ has fewer missing query tokens than ui. In 
other words, we want to estimate the probability 

Pq = Prob(«i is clicked more frequently than U2 
| miss(t Ul ) < miss(i„ 2 )) 

where miss(i Ui ) is the number of missing query tokens 
in the title t Ui for m. We balanced our samples to elim- 
inate potential position bias by ensuring that u± is pre- 
sented higher than in half of our examples. The es- 
timation of Pq in our data was 0.74, which is much 
larger than 0.5 and shows positive correlation. Further- 
more, we observed a stronger click preference if the title 
matching has larger difference for the two URLs. 

Pi = Prob(«i is clicked more frequently than U2 
| miss(t Ul ) + 1 < miss(i„ 2 )) = 0.83. 

This result demonstrates that the snippets with more 
missing query tokens in their titles tend to receive fewer 
clicks than those with fewer missing query tokens. This 
makes sense intuitively since a page with title matched 
well with queries is more likely to be more relevant. 
Note that the title matching is only a single feature 
among many possible signals that may influence user 
clicks. To model the click behaviors more accurately, 



Language Attractiveness Features 



Readability Features 



NumChars 


Number of characters in snippet 


Num Words 


Number of words in snippet 


NumSegments 


Number of pcriod/ellipsis-separated 
segments 


NumWordlnitCap 


Number of words with initial capitals 
in snippet 


FracWordlnitCap 


Fraction of words with initial capitals 
in snippet 


NumCapChar 


Number of capital characters in title or 
URL 


FracCapChar 


Fraction capital characters in title or 
abstract 


Word-level Attractiveness 


FracAttrWord 


Fraction of attractive words 


URL Attractiveness Features 


NumChars 


Number of characters in URL 


TopLcvclDomain 


The top level domain of URL 


NumLevelDomain 


Number of levels in domain 


Num Views 


Number of views (impressions) of URL 


Matching Attractiveness Features 


NumMatch 


Number of all matches in snippet 


NumUniqMatch 


Number of unique matches in snippet 


NumApxMatch 


Number of approximate matches in 
snippet 


FracMatch 


Fraction of matches in snippet 


FracApxMatch 


Fraction of approximate matches in 
snippet 


NumBefMatch 


Number of words before the first match 


NumBtwMatch 


Number of extra words between 
matches 


IsExactMatch 


Is whole query string exactly matched 


IsOrderMatch 


Are matches in the exact order 


IsSegMatch 


Are all matches occur in a single seg- 
ment 



Table 1 Summary of the snippet features. 



we need to seek a more comprehensive set of such snip- 
pet features to capture the perceived relevance more 
precisely. 

4.2 Search Snippet Features 

Our goal is to develop a comprehensive set of snippet 
features that capture the attractiveness of results. We 
define our features from the following perspectives: lan- 
guage attractiveness, URL attractiveness, and query- 
snippet matching attractiveness. All the features are 
summarized in Table Q] and we describe them separately 
in the following. 

4-2.1 Language Attractiveness. 

We model the language attractiveness by two sets of 
features: readability and word-level attractiveness. (1) 
Recent studies such as [TT] show that the readability 
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Category 


Attractive Words 


Recency 


latest, breaking 


Importance 


official, standard, homepage 


Popularity 


images, pictures, video, gallery 


Others 


free, sale, specials, welcome, login 



TopLevclDomain 


com 


org 


net 


edu 


others 


Percentage 


75.26 


10.19 


4.08 


1.89 


8.58 



Table 2 Sample of attractive words identified by our 4-test. Cat- 
egories arc manually labelled. 



of snippets in a search result page can directly impact 
users' click-through behavior. In this work, we define 
some readability features similar to those proposed in 
[T51l2T)ll2l)] and also some new features based on our in- 
tuitive judgments and experiments. This set of features 
are mainly to model the syntactic information of titles 
and abstracts of the snippets. For example, the feature 
NumSegment measures the number of fragments sepa- 
rated by an ellipsis or a period in abstracts which in 
some sense reflects how easy the snippets can be read. 
(2) The word-level attractiveness is to model the lan- 
guage in a semantic level. Previous researches [TTll2"51 
[T5Il2"5] also show that some terms in titles (e.g., "of- 
ficial" or "gallery") specify a certain genre and influ- 
ence user clicks noticeably. To identify these words, 
we use a t-test based on the URL attractiveness val- 
ues estimated by the DBN model. Specifically, given 
head queries with attractiveness of URLs estimated by 
DBN, we form two sets of titles, A and U, where A in- 
cludes the titles of the two most attractive URLs and U 
includes the titles of the two most unattractive URLs 
of every query. An attractive words will have higher 
discriminative power between A and U and a less at- 
tractive word will have smaller difference between A 
and U. For each word to, we perform a t-test on the 
mean difference between wa = {I{w £ T) \ T E A} 
and wu = {I(w <E T) \ T £ U} where I is an indica- 
tor function. Table [5] shows some examples of attractive 
words identified by our test with p-valuc < 0.05. Intu- 
itively, a title with these words can attract users' clicks 
for certain information needs. 

4-2.2 URL Attractiveness. 

URLs in snippets are also used by end users to select 
search results since URLs can implicitly tell users the 
reputation or quality of the landing pages |28j . For ex- 
ample, a URL with ".edu" in its domain is a good in- 
dicator for academic-related queries. A long URL with 
high depth is probably less attractive than a URL with 
low depth if a user intends to find some broad informa- 
tion. We thus define the URL attractiveness features 
as shown in Tabic [TJ All the URL features are query 
independent. For example, although a URL may not 
received any clicks for a tail query but it can be prob- 



Table 3 Distribution of top level domain of highly clicked URLs 
in our click logs. 



ably clicked again if it has received many clicks in the 
search logs. This is captured by our NumVicws fea- 
ture for URLs. Furthermore, we define a categorical fea- 
ture TopLevelDomain in Table [T] to roughly capture the 
URL types. Tabic [3J lists the distribution of the highly 
clicked top-level domains identified in our search logs. 
The feature TopLevclDomain takes one of the 5 possible 
values in Table [3j 

4-2.3 Matching Attractiveness. 

Query-biased snippets are regarded as the most rele- 
vant part of the landing page by snippet generation 
methods [53]. The matching fragments of a title, URL 
and abstract provides passage level relevance evidence 
between query and documents [5] and also play an im- 
portant role in users' evaluation of the relevance of 
the landing page. We define a set of matching attrac- 
tiveness in Table Q] in a similar way to the matching 
features between queries and whole documents. Our 
matching features cover string-level match, token-level 
match, matching positions (NumBtwMatch and Num- 
BcfMatch) and matching coherence and proximity (Is- 
ScgMatch and NumBtwMatch), etc. We also include 
approximate matches which arc computed based on the 
edit distance between query tokens and words in snip- 
pets. This feature can capture the morphological vari- 
ants and also acronyms. We discretize the approximate 
match in to binary values by thresholding. For example, 
FracApxMatch is computed as the fraction of query to- 
kens which have approximate matches in titles or URLs. 

FracApxMatch : ApxMatch(o, T U U) 

where Q,T,U are a set of tokens in the query, title and 
URL respectively and 

1 if q approximately matches 
ApxMatch(g, S) = ^ a token in S 
otherwise. 

The longer the token q, the more distance we allow 
in approximate matches. This feature has been shown 
to be an important feature in terms of discriminative 
power in prediction in our experiments. 

Our matching features can be also extended to an 
expanded set of queries for a given URL. Though we 
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have no click information for tail queries, we still have 
clicked information for a candidate URL. In our log, we 
can have a set of queries which have led clicks to the 
URL as the expanded set of queries. We can thus com- 
pute the matching attractiveness of this set of queries 
and use them as additional snippet features. Let Q exp (u) 
denote the set of queries for which the URL u has been 
viewed and clicked by users in our logs. Given a query 
q and a URL u, we define 

FracMatch_Expanded : FracMatch^j, Q e xp(u)) 

where FracMatch denotes the fraction of query tokens 
in the expanded query set. For example, given q= "puma 
concolor," the following URL: 

URL: en.wikipedia.org/wiki/MountainJion 
Title: Cougar -Wikipcdia, the free encyclopedia 

We have the expanded query set as {cougar, mountain 
lion, concolor}. Although there is no matching between 
the original query q and the corresponding URL, i.e, 
FracMatch(g, u) = 0, we have FracMatch_Expandcd(<7, 
u) = 0.5. This makes sense because concolor is also 
known as cougar or mountain lion, depending on re- 
gions. This example shows that expanded query match 
features can deal with some synonym or misspelling 
problems effectively. 



5 Leverage Snippet Features 

Given a query, we use Xj <G R d and Sj £ M. 1 to represent 
the original ranking features and the snippet features 
for document i. A traditional ranking function f org : 
M. d — > R maps the original ranking features to a real 
value and all the documents for a query is ranked by 
f org in descent order. We leverage the snippet features 
to predict perceived relevance and enhance the search 
result ranking. 



5.1 Predict Perceived Relevance 

We train an attractiveness function f attr : R ( — > R 
based on the snippet features Sj and the attractiveness 
score at estimated using DBN model. We obtain our 
training data by applying our feature definition and 
the DBN model on a set of popular queries with suffi- 
cient click information. Since f a ttr only relies on a set 
of snippet features, it can be applied to tail queries. 
We use the GBRank [37] method to find the optimal 
fattr to minimize the following pairwise loss function. 



Let V = {(s<, Sj, o,- — Oj)}. Our loss function is: 

/ v 2 

y^max (0, (Oj - Oj) - (fattr(Si) ~ f a Ur(Sj))) ■ 
V 

The function f attr can be used to predict the perceived 
relevance between any query and URL. 

5.2 Improve Web Search Ranking 

In this section, we discuss how to leverage our snip- 
pet features to enhance the ranking. We propose two 
strategies and discuss their application scenarios in the 
following. 

5.2.1 Strategy I 

To leverage the snippet features, our first strategy is to 
use the predicted perceived relevance scores and com- 
bine them with the original ranking scores to rerank the 
top search results. Specifically, we propose the following 
scenario to apply our strategy: 

— An initial query is issued and the ranking function 
forg is used to select a few top results. 

— The snippet generation method receives the selected 
documents. It generates the snippets and also the 
snippet features. Based on snippet features, f a ttr is 
used to estimate the perceived relevance. 

— The final ranking of search results is ranked based 
on a linear combination of f org and f a ttr- 

fl = A ■ forg + (1 — A) • fattr- 

5.2.2 Strategy II 

The first strategy is a simple linear combination of 
the predicted scores. Our second strategy is to go to 
the feature level and expand the ranking features Xj 
by Sj. Thus we form a longer feature vector [xj,s,-] 
for each document. We train a new ranking function 
fl : R d+/ — > R on these concatenated vectors. Appar- 
ently, it is difficult to directly apply such a strategy on 
a search engine since the search snippets features can 
be generated only after the snippets are generated. We 
thus propose the following scenario in a feedback set- 
ting to have two rounds of retrieval. 

— An initial query is issued and the ranking function 
f or g is used to return the search results and generate 
snippets for the top ranked results. 

— We provide users an additional button "Refresh to 
Improve" which is intended to improve search re- 
sults if a user is not satisfied with the current results 
and clicks the button. 
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— After the button is pressed, all the snippet features 
are generated for top results and the new ranking 
function fjj is used to generate a new search result 
page. 

This strategy can be used without user interference 
by search engines. However, such a strategy may be 
risky for those queries for which the original ranking is 
already good enough. The button "Refresh to Improve" 
is a safe alternative when a user is not satisfied with the 
current results. 



6 Experiments 

We perform two types of experiments. First, we eval- 
uate the performance of our attractiveness prediction. 
Then, we use the predicted attractiveness and the de- 
fined snippet features to improve the ranking accuracy. 

6.1 Predict Perceived Relevance 
6.1.1 Experiment Setup 

We first test the predictive accuracy of our proposed 
prediction model for tail queries. A difficulty in this test 
is that we cannot obtain the "true" target attractive- 
ness for tail queries: The estimation of attractiveness 
(by click models) is not reliable for tail queries due to 
the limited amount of click information. Thus, we need 
to simulate tail queries by sampling only a small sub- 
set of click logs of non-tail queries. Please note that the 
target attractiveness is obtained before the sampling. 

We get click logs from a commercial search engine. 
The click log data is a set of sessions. A session is asso- 
ciated with a unique user and a unique query. It starts 
when a user issues a query and ends with 60 minutes 
idle time on the user side. Each session contains the list 
of URLs in the search results page and list of clicked 
URLs. We select queries with enough sessions to ensure 
the reliable target values. After this filtering, we obtain 
40M sessions and 20K unique queries. Let this original 
set of sessions be S. Then, S is split into the training 
set Strain , the validation set S va iidation and the test set 
Stest- For all (qucryURL) pairs in these sets, we ob- 
tain snippet features and target values (attractiveness 
computed by the DBN click model using the full data). 
We get Slgll by sampling 10 random sessions for each 
query in S tes t- 

The evaluation is based on comparing pairwise at- 
tractiveness values predicted by our proposed model 
fattr to the "true" pairwise attractiveness values de- 
rived from the DBN click model using the full session 



data: For two URLs, Uj and Uk for query i, we predict 
that URL Uj is more attractive than URL Uk if 

fattr(x-i,j) — /attr( x i,fe) > T - 

Then, we test if a,^ > a^k where Qij and a,,/s are the 
true target attractiveness values computed by the DBN 
click model using the full session data S tes t ■ Hence, this 
is a binary classification problem. With a different r 
values, we have a different levels of precision and recall. 
Thus, by varying r, we can get a precision-recall curve. 

Based on the test data, we compare the predictions 
given by the following: 

— a tal1 : Attractiveness computed by the DBN click 
model using sampled session data S\^f t . 

~ f snippet '■ Function trained on only snippet features. 

— fsnippet+ciick '■ Function trained on both snippet fea- 
tures and clicks. 

a taii p rov i(jes the baseline predictions: If o*"' — a'"' 
> r, we predict that URL Uj is more attractive than 
URL u k . 

We apply GBRank to train a function f snippet on 
the pairwise training data 

V ={(xjj,x Jife ,a iJ - -a i)k )\ 

i G {l,...,N}J,k S {l,...,10},Oij > a^ k } 

where a%j and o^fe are the attractiveness values com- 
puted by the DBN click model using the whole session 

data Strain' 

Once we train f snippet on the training data, it can 
be used for new queries for which no click informa- 
tion is available. However, for tail queries for which 
some amount of click data is available, we can combine 
our attractiveness model and the click information. A 
straightforward way to combine the two is to have a 
linear combination of the two predictions: 

A,/ 'snippet + (1 — A)a 

where a is the attractiveness computed by the DBN 
click model using the available click data and A depends 
on the frequency of a query (the more frequent, the 
smaller A becomes). However, we would have to tune 
A manually or design a heuristic function for A. More 
principled way of combining the attractiveness model 
and click information is to use the click information as 
a feature and let the training procedure figure out the 
optimal combination. To this end, we generate another 
session data S' train as follows. For each query in Strain, 
we sample r% of sessions where r is randomly selected 
to ensure that the sampled session data contains queries 
with various frequencies. Then, each feature vector Xjj 
in our training data is expanded to include two addi- 
tional features: 
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°\o 0.2 0.4 0.6 0.8 1.0 

Recall 

Fig. 1 Precision vs. recall of 3 different ways of predicting attrac- 
tiveness for tail queries. 'Summary' represents a function f snippet 
trained on only snippet features. 'Click' represents predictions 
given by a ta ' 1 , attractiveness computed by the DBN click model 
using a limited amount of click information. 'Summary and Click' 
represents a function f S nippe.t+ click that combines both predic- 
tions. 

— al i j : Attractiveness computed by the DBN click 
model using S' tram 

— sessiorii : The number of sessions for query i in 

train 

Note that we still use the true attractiveness a;j (com- 
puted by using the full data Strain) as targets. The new 
pairwise training data is 

V = {((Xij , a't j , sessiorii), (x^ fc , a' i k , sessiorii), cjjj - a^k) 
| i e {l,...,N},j,k e {l,...,10},Oij > a 4 ,fc}- 

Then, we train a GBrank function f sn ippet+ciick on this 
data. 

6.1.2 Experimental Results 

We summarize the precision-recall results of a tal1 , f snippet 
and f S nippet+ciick in Figure [TJ The result shows that the 
combination of our attractiveness model and the click 
information clearly outperforms either one. 

After the training process, we obtain the list of fea- 
tures ordered by their importance (See [17] for the defi- 
nition of importance of features) . We have the following 
observations: 

— For fsnippet+ciick, the attractiveness and the num- 
ber of sessions are among top three features in the 
importance list. When we look into the decision tree 
structure, we find that the two features function to- 
gether: When the number of sessions is large (i.e. 
we have sufficient click information) , the attractive- 
ness computed by the DBN click model should be 



weighted more than snippet features. On the other 
hand, when we have a small number of sessions (i.e. 
tail queries), snippet features should play a more 
important role. 

— Length of URL is the second most important feature 
for f snippet and the fourth for f sn ip pe t+ dick, which 
agrees with the results by 

— Features related to URL and title are more impor- 
tant than those for abstract. 



6.2 Improve Ranking Relevance 

We construct our data sets to test the effectiveness of 
our defined snippet features from a commercial search 
engine. The training examples are labeled using five val- 
ues, {0,1,2,3,4}, representing five levels of relevance. 
Our evaluation is based on NDCG 5 and NDCGi . NDCG fe 
is defined to be 

ND CG fc = — Y- %— 

where Gi is the function of relevance grade of the docu- 
ment at rank position i and Zk represents a normaliza- 
tion factor to guarantee that the NDCGfc for the perfect 
ranking (among the permutations of the retrieved doc- 
uments) is 1. 

We have a conventional data set which has the most 
informative 20 original ranking features, including some 
click-based features, to train a conventional ranking 
function. Since we aim at improving relevance for new 
or tail queries, we collect (qucry,URL) pairs which have 
no click related information from the above data set. 
We treat all the queries in the resulting data set as 
tail queries. Table 0] shows the distribution of the tail 
queries with respect to query length and their corre- 
sponding search accuracy using our baseline ranking 
function. Clearly, long queries cover a large portion 
of the tail queries. Furthermore, we can also see that 
while short tail queries can achieve reasonable accu- 
racy, long queries usually have much worse search ac- 
curacy. This means that the baseline ranking function 
is less effective for longer tail queries. Thus, in the fol- 
lowing experiments, we consider the queries with more 
than or equal to 3 tokens to help these more difficult 
tail queries. We split the data into training and test. 
In the training data, we have 202K (queryURL) pairs, 
resulting in 2M preference pairs. In the test data, we 
have 46K (query, URL) pairs and 545K preference pairs. 
Since no click information is available, all the queries in 
both training and test data can be regarded as unseen 
queries. 
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Ratio of tail queries 


NDCG 5 


1 token 


8.55% 


0.801 


2 token 


18.0% 


0.720 


3 token 


32.5% 


0.653 


+4 token 


46.3% 


0.588 



Table 4 Ratio of tail queries and search quality broke down by 
query length. 




0.605 
0.604 
0.603 
0.602 



baseline 



strategy I 



strategy I 



the 



strategy I ranking function // and the strategy II method fj 



To obtain the training data to learn the attractive- 
ness function f a ttr, we use the data set used in the pre- 
vious section. Each session contains the list of URLs in 
the search result page and list of clicked URLs. We se- 
lect queries with enough sessions to ensure the reliable 
target values. 

Figure [2] shows the accuracy comparison of the base- 
line ranking (f org ), strategy I (//), and strategy II (///) 
using NDCG5 as the metric. For all these methods, we 
tune the GBRank parameters and A to be the optimal. 
From this figure, we can see that both our strategies can 
improve over the baseline ranking. For example, strat- 
egy II improve over the baseline by 0.8% relatively and 
this is statistically significant based on the Wilcoxon 
test (p-value < 0.01). Although strategy I is also able 
to improve over the baseline, the improvement is not 
statistically significant. Comparing the two strategics, 
strategy II is more effective than strategy I. This shows 
that the second strategy of directly training a new rank- 
ing function can better leverage the snippet feature sig- 
nals. 

In Figure [21 we show the impact of the parameter A 
in strategy I using both metrics NDCG5 and NDCGi . 
When A = 1, the result is the same as the baseline. 
From this figure, we can see that strategy I can only 
marginally improve over the baseline method in terms 
of NDCG5. However, we observe significant improve- 
ment of strategy II over baseline in terms of NDCGi . 
For example, when A = 0.5, the NDCGi of // is 0.554 
and achieves 2.6% relative improvement over 0.540 of 
f org and this improvement is also statistically signifi- 
cant. This means that our strategy I is more effective 
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Fig. 3 The impact of the linear combination factor A of strategy 
I. 



for higher ranked documents. This also means that the 
attractiveness scores from DBN is more accurate to pre- 
dict higher ranked results and this is reasonable because 
the highly ranked documents is less influenced by the 
position bias. 

Overall, we can see that both our strategies are ef- 
fective to improve search accuracy. This confirm the 
effectiveness of our defined snippet features. 



7 Conclusions and Future Work 

In this paper, we studied how to model perceived rel- 
evance for tails queries without relying on any click- 
through data. We developed a set of snippet features 
to capture the attractiveness or perceived relevance of 
Web search results and proposed two novel strategies to 
leverage these snippet features to improve tail queries. 
We show that the two strategies can be naturally in- 
corporated into a search process. We conduct experi- 
ments on a large data set from a commercial search en- 
gine. Our results confirm the defined snippet features 
are able to predict the perceived relevance effectively. 
Furthermore, the search accuracy of tail queries can be 
significantly improved by using the snippet features. 

Our work is one of the few work on directly improv- 
ing search accuracy for tail queries. In the future, one 
interesting direction is to provide a unified framework 
to jointly model both clicks and snippet features to- 
gether so that information of head queries can be prop- 
agated to tail queries in a more principled way. A main 
challenge for tail queries is due to lack of users' feed- 
back and a possible direction is to leverage the relation 
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between queries such as a query graph to better capture 
the attractiveness of search results for tail queries. 
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